In the first version of our Farsi Text-To-Speech (TTS) system, a Recurrent Neural Network (RNN) was used to generate prosody parameters (pitch contour, duration, energy and pause), and a Harmonic + Noise Model (HNM) speech synthesizer was used to concatenate the single units of diphones. To improve the ...
Read More
In the first version of our Farsi Text-To-Speech (TTS) system, a Recurrent Neural Network (RNN) was used to generate prosody parameters (pitch contour, duration, energy and pause), and a Harmonic + Noise Model (HNM) speech synthesizer was used to concatenate the single units of diphones. To improve the performance of TTS, in this paper, two modifications are presented. In the first one is a neural-statistical hybrid model in which RNN plays the role of prosody parameterizer and the combination of decision trees and Gaussian Mixture Models (GMMs) gives the probability distributions of targets and transitions in each context a equivalent cluster. Another modification is about developing a unit selection speech synthesizer in which syllable is selected as the basic synthesis unit and, due to the first modification, an effective unit selection strategy is also conducted. To evaluate the performance of the system, the rating scales presented in the recommendation P.85 of the International Telecommunication Union (ITU) were used and the Mean Opinion Score (MOS) over six scales was achieved as 3.6.